In this lab, you will experiment with using Principal Components to display the result of clustering on the wholesale customers dataset. You will see that the linear algebra we are learning in class, such as on orthonormal matrix and scalar projections, can lead to powerful visualizations of high dimensional data.
PCA is great for taking a high dimensional dataset and then displaying it in low dimensions. By itself PCA is not always that illuminating, but if you combine PCA with the clustering you can frequently see valuable structure in the data. This technique is a great one to have in your bag of tricks as a data scientist.
Complete the questions marked in \(\color{blue}{Exercise}\) and turn in by uploading to LMS. Start the lab by saving the master .Rmd file to your local directory, editing the header to include your name, set your working directory, and then execute the lab. Be sure to refer back to the prelab and prior labs as a guide for the exercise.
This dataset looks at the spending habits of consumers.
wholesale_customers_data.csv can be found in the ~/MATP-4400/data directory.
Here is a description of the features in the dataset
Fresh: annual spending on fresh products (Continuous);Milk: annual spending on milk products (Continuous);Grocery: annual spending on grocery products (Continuous);Frozen: annual spending on frozen products (Continuous);NonFood: annual spending on detergents and paper products (Continuous);Delicatessen: annual spending on and delicatessen products (Continuous);To prepare the data: we read in the wholesale customers data and REMOVE THE FIRST TWO COLUMNS and then center and scale the data using the ‘scale’ command. We save the scaled data in scaled.df to use for your analysis. Note that the ‘scale’ and ‘as.numeric’ commands were included in the ‘mutate_all’ command. This says to apply them to all the columns of the selected data.
raw.df <-read.csv("~/MATP-4400/data/wholesale_customers_data.csv")
scaled.df <-
raw.df %>%
select(Fresh,Milk,Grocery,Frozen,NonFood,Delicatessen) %>%
mutate_all(scale) %>%
mutate_all(as.numeric)
# Convert to matrix using data.matrix (this keeps column names)
D.matrix <- data.matrix(scaled.df)
head(scaled.df) # check out that the data has correct format
## Fresh Milk Grocery Frozen NonFood Delicatessen
## 1 0.05287300 0.52297247 -0.04106815 -0.5886970 -0.04351919 -0.06626363
## 2 -0.39085706 0.54383861 0.17012470 -0.2698290 0.08630859 0.08904969
## 3 -0.44652098 0.40807319 -0.02812509 -0.1373793 0.13308016 2.24074190
## 4 0.09999758 -0.62331041 -0.39253008 0.6863630 -0.49802132 0.09330484
## 5 0.83928412 -0.05233688 -0.07926595 0.1736612 -0.23165413 1.29786952
## 6 -0.20457266 0.33368675 -0.29729863 -0.4955909 -0.22787885 -0.02619421
We run PCA and use the summary command and plot commands to see how successful the PCA was.
my.pca<-prcomp(scaled.df,retx=TRUE) # Run PCA and save to my.pca
plot(my.pca, type="line")
summary(my.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.6263 1.3048 0.8603 0.75082 0.53449 0.2509
## Proportion of Variance 0.4408 0.2838 0.1233 0.09396 0.04761 0.0105
## Cumulative Proportion 0.4408 0.7246 0.8479 0.94189 0.98950 1.0000
What proportion of variance is explained by the first two principal component vectors?
How many vectors does it take to explain at least 90% of the variance?
If the proportion of variance by the first few vectors is high then score plots and biplots can provide a good summary of the data. A score plot shows the scalar projections of the data on two selected principal components. A biplot is a score plot with additional vectors plotted that correspond to the unit vectors corresponding to each feature. These vectors represent the importance and direction of each feature in determining the scalar projections. They are creating by plotting the scalar projections of the unit vectors. You may need to zoom in to see them.
This is the biplot for the first two principal components. The number of each observation is shown on the biplot. Note that if you knit this to html, the biplot is made interactive using the ‘plotly’ command. If you knit this to pdf, it will be just a normal biplot.
# Make the biplots of the first two components
plot1<-ggbiplot(my.pca,choices=c(1,2),
labels=rownames(scaled.df), #show point labels
var.axes=TRUE, # Display axes
ellipse = FALSE, # Don't display ellipse
obs.scale=1) + # Keep original scaling
ggtitle("Consumer Data Projected on PC1 and PC2 ")
if (out_type=="latex") {plot1} else {ggplotly(plot1)}
Note that these plots show a big blob of data with relatively few points far away from the blob (these may be outliers). This tells us that there are some consumers that have unique buying habits (e.g. 184 and 86 ). We can see that point 86 has a low scalar projection for PC1 and relatively high scalar projection for PC2. While 184 has low scalar projections with respect to both principal components.
If we combine PCA with kmeans (or your favorite clustering method), we can get an even richer picture of consumer patterns.
First let’s examine the parts of PCA returned by prcom using ‘str’. V<-my.pca$rotation contains the eigenvectors/principal components. scores<-my.pca$x contains the scalar projections on each principal components (usually called scores).
str(my.pca)
## List of 5
## $ sdev : num [1:6] 1.626 1.305 0.86 0.751 0.534 ...
## $ rotation: num [1:6, 1:6] -0.0429 -0.5451 -0.5793 -0.0512 -0.5486 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "Fresh" "Milk" "Grocery" "Frozen" ...
## .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:6] -3.39e-17 -3.96e-18 -5.51e-17 3.65e-17 3.33e-17 ...
## ..- attr(*, "names")= chr [1:6] "Fresh" "Milk" "Grocery" "Frozen" ...
## $ scale : logi FALSE
## $ x : num [1:440, 1:6] -0.193 -0.434 -0.81 0.778 -0.166 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
V<-my.pca$rotation
V
## PC1 PC2 PC3 PC4 PC5 PC6
## Fresh -0.04288396 -0.52793212 -0.81225657 -0.23668559 0.04868278 0.03602539
## Milk -0.54511832 -0.08316765 0.06038798 -0.08718991 -0.82657929 0.03804019
## Grocery -0.57925635 0.14608818 -0.10838401 0.10598745 0.31499943 -0.72174458
## Frozen -0.05118859 -0.61127764 0.17838615 0.76868266 0.02793224 0.01563715
## NonFood -0.54864020 0.25523316 -0.13619225 0.17174406 0.33964012 0.68589373
## Delicatessen -0.24868198 -0.50420705 0.52390412 -0.55206472 0.31470051 0.07513412
scores<-my.pca$x
head(scores)
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -0.1930708 0.3047531 -0.14071827 -0.4858785 -0.4947183 0.007405709
## [2,] -0.4339260 0.3280392 0.31864390 -0.1786270 -0.3651636 -0.054509798
## [3,] -0.8102210 -0.8141689 1.52168348 -1.2526556 0.3786225 0.277223012
## [4,] 0.7777625 -0.6520115 0.16282692 0.3796280 0.2758236 -0.060648502
## [5,] -0.1660982 -1.2699881 0.06620403 -0.8252873 0.3937624 0.026794141
## [6,] 0.1559924 0.2948054 0.14744411 -0.4178127 -0.4789097 0.053878989
# Insert your answer here
# Insert your answer here
Compute the scalar projections of consumer 86 on each of the eigenvectors by multiplying that point times the eigenvectors.
Verify that these are the same as my.pca$x[86,] by computing difference between your calculation and my.pca$x[86,] and summarizing the differences found.
# Insert your answer here
Explain mathematically how sample 86’s eating habits result in a low scalar projection for PC1. How do the red feature axes show you roughly the same facts?
Explain sample 86’s eating habits that make it have a relatively high scalar projection with respect to PC2? How do the red feature axes show you roughly the same facts?
# Insert your answer here
set.seed(20)
Look at the number of points in each cluster. Find the cluster that contains sample number 86. Let’s call this Cluster A. How many points are in cluster A?
Draw a heatmap (not scaled) of the cluster centers found by kmeans.
# Insert your answer here
heatmap.2() command with dendrograms but with no scaling. Add a vertical side bar that shows the cluster each row belongs to by using the RowSideColors argument to heatmap.2(), such as: RowSideColors = as.character(km$cluster)the clusters labels as an extra row (see rowsidecolors in prior lab). Make sure to title the heatmap.
# Insert your answer here
# Insert your answer here
# Insert your answer here
END OF LAB 4